# Word Vectors in Python with gensim

This notebook shows an example of how to use word vectors in Python with gensim.

## Load Libraries

We're using a new library called `gensim`.  It's a great library for modeling text and comes with pre-trained models that you can easily use in other contexts.

In [0]:
%matplotlib inline
import numpy as np
import gensim
import matplotlib.pyplot as plt
import seaborn as sns

from collections import defaultdict

### Cleaning Our Corpus

(corpus is another name for the set of documents under consideration)

In [0]:
documents = [
    "Perfect is the enemy of good.",
    "I'm still learning.",
    "Life is a journey, not a destination.",
    "Learning is not attained by chance, it must be sought for with ardor and attended to with diligence.",
    "Yesterday I was clever, so I changed the world. Today I am wise, so I am changing myself.",
    "Be curious, not judgmental.",
    "You don't have to be great to start, but you have to start to be great.,"
    "Be stubborn about your goals and flexible about your methods.",
    "Nothing will work unless you do.",
    "Never give up on a dream just because of the time it will take to accomplish it. The time will pass anyway.",
    "Anyone who stops learning is old, whether at twenty or eighty.",
    "Tell me and I forget. Teach me and I remember. Involve me and I learn.",
    "Change is the end result of all true learning.",
    "Live as if you were to die tomorrow. Learn as if you were to live forever.",
    "A learning curve is essential to growth.",
]

# remove common words and tokenize
stop_words = set('for a of the and to in'.split())
texts = [[word for word in document.lower().replace("'", "").replace(".", "").split() if word not in stop_words]
         for document in documents]

## remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
texts

[['is'],
 ['learning'],
 ['is', 'not'],
 ['learning', 'is', 'not', 'it', 'be', 'with', 'with'],
 ['i', 'so', 'i', 'i', 'am', 'so', 'i', 'am'],
 ['be', 'not'],
 ['you', 'have', 'be', 'you', 'have', 'be', 'about', 'your', 'about', 'your'],
 ['will', 'you'],
 ['time', 'it', 'will', 'it', 'time', 'will'],
 ['learning', 'is'],
 ['me', 'i', 'me', 'i', 'me', 'i', 'learn'],
 ['is', 'learning'],
 ['live',
  'as',
  'if',
  'you',
  'were',
  'learn',
  'as',
  'if',
  'you',
  'were',
  'live'],
 ['learning', 'is']]

### Creating our Word2Vec Model

`gensim` makes it easy to train a Word2Vec model.  All training requires is passing in the corpus.

In [0]:
model = gensim.models.Word2Vec(texts, size=10, window=2, min_count=1)
model

<gensim.models.word2vec.Word2Vec at 0x7ffad4bcb940>

In [0]:
model['live']

  """Entry point for launching an IPython kernel.


array([ 0.04093517, -0.00830318, -0.03466499, -0.03480424, -0.00121745,
       -0.01612454,  0.02507611, -0.02705088,  0.01547491,  0.02742862],
      dtype=float32)

And we can find the most similar words too.  Obviously, our dataset is too small and we won't find anything too interesting.

In [0]:
model.most_similar('live')

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('time', 0.4510110020637512),
 ('me', 0.3452457785606384),
 ('as', 0.27253618836402893),
 ('with', 0.2470661848783493),
 ('will', 0.20179931819438934),
 ('have', 0.17683294415473938),
 ('learning', 0.13080498576164246),
 ('if', 0.12683644890785217),
 ('were', 0.0977015271782875),
 ('learn', 0.017691776156425476)]

### Loading an existing corpus

We can load some existing text and train a model on it.  In this case, we're going to use `text8` which is a small subset of Wikipedia (31MB).  See: https://github.com/RaRe-Technologies/gensim-data

In [0]:
import gensim.downloader as api

# This could be slow to download...
corpus = api.load("text8")
corpus



<text8.Dataset at 0x7ffad46f2438>

In [0]:
# Using small numbers here, probably want to use a bigger corpus, bigger dimensions, and more iterations.
model = gensim.models.Word2Vec(corpus, size=10, window=2, iter=5, min_count=1)
model

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


<gensim.models.word2vec.Word2Vec at 0x7ffad46ea588>

We can get slightly better results (but we really should be using a much bigger corpus)

In [0]:
model.most_similar("queen")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('tsar', 0.9781045913696289),
 ('king', 0.9707283973693848),
 ('roosevelt', 0.9589150547981262),
 ('alsacian', 0.9491715431213379),
 ('nixdorff', 0.9476499557495117),
 ('prince', 0.9449954032897949),
 ('churchill', 0.9445421695709229),
 ('hadrian', 0.9420450925827026),
 ('vampyr', 0.9411798119544983),
 ('maccabess', 0.9398282170295715)]

In [0]:
model['queen']

  """Entry point for launching an IPython kernel.


array([ 2.6956127 ,  1.2827659 ,  2.3695884 ,  0.6383235 , -2.2322483 ,
        0.37769857,  2.3656695 ,  0.7270439 ,  3.1721518 ,  0.6553371 ],
      dtype=float32)

### Loading a pre-trained model

We can also use the gensim to automatically download and load a pre-trained model, or alternatively load it from disk.  Since the pre-trained model has much more data, the vectors encode some semantic meaning.

In [0]:
# Glove is another word embedding that uses a slightly different technique than word2vec, 
# however, it has the same properties and API
model = api.load("glove-wiki-gigaword-50")
model



In [0]:
model.most_similar('queen')

Alternatively, we can load the same model directly from disk (the previous calls cache the files in `~/gensim-data/`.

In [0]:
model = gensim.models.KeyedVectors.load_word2vec_format('~/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz')
model

In [0]:
model.most_similar('queen')

In [0]:
model['queen']

### Visualizing Words with T-SNE

As we saw in the slides, we can visualize the distance between words using T-SNE.

In [0]:
from sklearn.manifold import TSNE

In [0]:
# Gather a listing of random words
words = ['queen', 'princess', 'dog', 'king', 'cat', 'obama', 'clinton', 'president', 'math', 'brian']

vecs = np.array([model[w] for w in words])
vecs

In [0]:
vecs_tsne = TSNE(n_components=2, perplexity=3).fit_transform(vecs)
vecs_tsne

In [0]:
ax = sns.scatterplot(vecs_tsne[:, 0], vecs_tsne[:, 1], s=100)
sns.set(font_scale=1.5)
[ax.text(p[0], p[1]+10, word, color='black') for word, p in zip(words, vecs_tsne)]
pass